Graph Grammar Based Web Data Extraction

نویسندگان

  • Amin Roudaki
  • Jun Kong
چکیده

Web data extraction becomes a hot topic after the invention of World Wide Web, because the large amount of information on the Web makes it challenging to retrieve useful information. Due to the diverse designs and presentations of information on different Web sites, it is hard to implement a general solution to extract data across different Web sites. This paper presents a novel method based on graph grammar to extract the same type of information from different Web sites without the need of training or adjustment. Our approach formalizes a common Web pattern as a graph grammar. Then, based on the visual layout and HTML DOM structure, a Web page is abstracted as a spatial graph that highlights the essential spatial relations between information objects. According to the defined graph grammar, a spatial parsing is performed on the spatial graph to extract structured records. We have evaluated our approach on twenty one different Web sites, and achieved the F1-score as 97.49% which shows promising performance.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Data Extraction using Content-Based Handles

In this paper, we present an approach and a visual tool, called HWrap (Handle Based Wrapper), for creating web wrappers to extract data records from web pages. In our approach, we mainly rely on the visible page content to identify data regions on a web page. In our extraction algorithm, we inspired by the way a human user scans the page content for specific data. In particular, we use text fea...

متن کامل

On Automatic Information Extraction from Large Web Sites

Information extraction from Web sites is nowadays a relevant problem, usually performed by software modules called wrappers. A key requirement is that the wrapper generation process should be automated to the largest extent, in order to allow for large-scale extraction tasks even in presence of changes in the underlying sites. So far, however, only semi-automatic proposals have appeared in the ...

متن کامل

Inferring the Structure of Graph Grammars from Data

Graphs can be used to represent such diverse entities as chemical compounds, transportation networks, and the world wide web. Stochastic graph grammars are compact representations of probability distributions over graphs. We present an algorithm for inferring stochastic graph grammars from data. That is, given a set of graphs that, for example, correspond to a set of chemical compounds, all of ...

متن کامل

CETUS - A Baseline Approach to Type Extraction

The concurrent growth of the Document Web and the Data Web demands accurate information extraction tools to bridge the gap between the two. In particular, the extraction of knowledge on real-world entities is indispensable to populate knowledge bases on the Web of Data. Here, we focus on the recognition of types for entities to populate knowledge bases and enable subsequent knowledge extraction...

متن کامل

Wrapper Maintenance

A Web wrapper is a software application that extracts information from a semi-structured source and converts it to a structured format. While semi-structured sources, such as Web pages, contain no explicitly specified schema, they do have an implicit grammar that can be used to identify relevant information in the document. A wrapper learning system analyzes page layout to generate either gramm...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2011